Dataset parameters

The document is not very large so we can analyze it in full without creating a sample. For large document we would need to create a sample, say choosing each 10th or 100th data points. There are 81 variables with 113937 observations. There are several factor variables. Clearly we need to focus on certain variables as there are too much data. Variable_list is a list of all variables with type of variables explained. Some factor variables are clearly not factor as there are too many sets, like ClosedDate has 2803 levels.

Potential issues

This is a loan performance data from Prosper and from the dictionary. I would like to explore the dependency of certain variables on others like Loan status, EmploymentStatus, Occupation, CreditGrade, Incomerange, LoanOriginalAmount, ProsperRating, CreditGrade, BorrowerState, LoanOriginalAmount, ProsperScore, ProsperPaymentsOneMonthPlusLate

From looking at individual columns we can see from that there are way too many occupations to be a factor variable. But ProsperRating, CreditGrade and ProsperScore look promising.

Initial analysis - Prosper Rating and BorrowerAPR

We can start with Rating and APR information as being one of the most imporant outcomes of the loan process. We can see below that for Rating the histogram is fairly symmetrical with defined mean value around 4.

For the Borrower APR there is clear preference of the rate to be around 38% while the rest of the APR is fairly distributed.

There is another way to look at data and to determine what is the most frequen rate, apparently it is not 36% but rather 17% (see below)

Now we will focus is on the amount of borrowing between 30 and 40%, where it is apparent that the 36% rate is the most prevalent.

## Initial research on loans - Bivariable Below is an attempt to figuring out dependencies between two variables by plotting few variables against one another. There is a dependency between Borrower APR and Prosper Rating both in graph and correlation, and there is an interesting distribution of Borrower APR by Income range. We will investigate those further.

Bivariant analysis - APR and Ratings

Now it’s a time to combine to variables - APR and Ratings. Below are histograms for Borrower APR divided by ProsperRating. Most widely distributed are loans APR within ratings 4 and 6 while the narrow distribution is for rating 1. Probably, borrowers with this rating are getting the maximum possible rate or not getting the loan at all.

We are exploring the dependency of ProsperRating vs BorrowerAPR and clearly the higher the rate the lowever is APR. The correlation line proves it as well as the correlation coefficient which is -0.96

## [1] -0.9621513

Boxplot will help to explore further this dependency. As we can see from below, median APR are surely decreasing with the Rating increase. There is wide variety of rates in the lowest rating category 1. However, overall, borrowers with rating 1 can obtain loans with the same APR as borrowers all ratings, even at the highest 7 rate. There is an outlier there.

Loan amount vs income range

We can see that the majority of loans are given to borrowers with income range between 75 -100K, with very small amount of borrowers with 100K+ income range. It is logical and higher earning borrowers probably don’t need cash loans or have better deals at traditional banks. We will leave this exploration as it is.

Rating 1 investigation

What is going on with rating 1, what states, professions, income level and other characteristics it represents? First we will create a table to focus on some of the most interesting variables. Then we will build a APR histograph on a log scale to show all the values, as APR rate for rating

### Occupations in Rating 1 We will explore the most frequent occupation which were rated the lowest rating 1. Clearly there is a leader. Leader is unfortunately “Other”, so we will get rid of that.

The top occupation in rating 1 is “Professional” followed by Adminstrative Assistant and Teacher.

I want to explore further the Occupation “Professional” as it very ambiguous. As you can see below majority of this occupation earns from 50 to above 100K. It means that this occupation has a varied pay based probably on firm and experience.

I want to see the distribution of income between Income Range and Employement Status duration but there is not really any meaningful distribution as the length of employment status is not long. Probably because the data does not cover sufficient time span.

I have done the above but in percentage terms. There is still not clear dependency of higher salary vs Employment Status Duration. ### New table with Occupation data
Now I am going to create a long table with the following colums - Occupation, median_BorrowerAPR, mean_ProsperRating..numeric., median_CreditScoreRangeUpper. By aggregating data to tables and then merging the data. I would like to explore dependencies of different variables on Occupation.

## 'data.frame':    68 obs. of  4 variables:
##  $ Occupation   : Factor w/ 68 levels "","Accountant/CPA",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ CreditScore  : num  679 719 699 719 719 719 719 699 699 719 ...
##  $ ProsperRating: num  4 4 4 5 5 5 4 4 4 4 ...
##  $ BorrowerAPR  : num  0.199 0.2 0.237 0.189 0.184 ...

APR vs CreditScore vs Occupation

Majority of occupations are scored under 700 and 720 where the highest APR was granted to Teacher’s and Nurse’s Aide. Students had lower score but lower APR as well, reflecting on their good potential for lenders. Surprising, an occupation Investor had a pretty high Credit Score but high APR as well. This indicates that even thought this person must be wealthy and have good credit score, probably his or her other parameters were risky.

Credit score and APR correlation

I want to see what is the correlation between - credit score and APR as below. There is low correlation (-0.43) between BorrowerAPR and Credit score which is somewhat surprising but given that it is a cash based loan, given to borrowers who exhausted other more traditional banking options it makes sense.

## [1] -0.4297073

Occupations of low APR and low Credit Score

There are few occupations which get low APR and have low credit score suprisingly.

Final Plots and Summary

APR vs CreditScore vs Occupation

Majority of occupations are scored under 700 and 720 where the highest APR was granted to Teacher’s and Nurse’s Aide. Students had lower score but lower APR as well, reflecting on their good potential for lenders. Surprising, an occupation Investor had a pretty high Credit Score but high APR as well. This indicates that even thought this person must be wealthy and have good credit score, probably his or her other parameters were risky.

Credit score vs APR

I want to see what is the correlation between - credit score and APR as below. There is low correlation (-0.43) between BorrowerAPR and Credit score which is somewhat surprising but given that it is a cash based loan, given to borrowers who exhausted other more traditional banking options it makes sense.

## [1] -0.4297073

Occupation Professional and APR

The occupation “Professional” as it very ambiguous. As you can see below majority of this occupation earns from 50 to above 100K. It means that this occupation has a varied pay based probably on type of work and experience. It is not a great indicator of the borrower’s credibility though.

Reflection

It is a a large set with many factor like variables. The context of loan related information like APR, Credit Score, Lender Credit Score is familiar to many, however, it should be noted that this is a different type of lending. Prosper specializes in short term risky cash loans and usual assuptions may not always apply there. Like there is not clear correlation between credit score and APR. Also some income levels and professions were not getting low APRs as expected. It is probably because Prosper is a niche lender serving the market not covered by traditional banking - i.e. those who exhausted other options or who were not granted loans by banks. It was interesting to explore occupation vs credit ratings (both Prosper and Credit Score) and it gave lots of insights to the data. I would further research on what Prosper bases its rating - what is the most important factor? Is it income level, number of loans, debt to income ratio? Does state location have any bearing on the Prosper rating and APR?